A Neural Model for Language Identification in Code-Switched Tweets
نویسندگان
چکیده
Language identification systems suffer when working with short texts or in domains with unconventional spelling, such as Twitter or other social media. These challenges are explored in a shared task for Language Identification in Code-Switched Data (LICS 2016). We apply a hierarchical neural model to this task, learning character and contextualized word-level representations to make word-level language predictions. This approach performs well on both the 2014 and 2016 versions of the shared task.
منابع مشابه
Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
This paper describes the HHU-UH-G system submitted to the EMNLP 2016 Second Workshop on Computational Approaches to Code Switching. Our system ranked first place for Arabic (MSA-Egyptian) with an F1-score of 0.83 and second place for Spanish-English with an F1-score of 0.90. The HHU-UHG system introduces a novel unified neural network architecture for language identification in code-switched tw...
متن کاملDeveloping Language-tagged Corpora for Code-switching Tweets
Code-switching, where a speaker switches between languages mid-utterance, is frequently used by multilingual populations worldwide. Despite its prevalence, limited effort has been devoted to develop computational approaches or even basic linguistic resources to support research into the processing of such mixedlanguage data. We present a user-centric approach to collecting code-switched utteran...
متن کاملEstimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for cod...
متن کاملA Language Modeling Approach to Identifying Code-Switched Sentences and Words
Globalization and multilingualism contribute to code-switching – the phenomenon in which speakers produce utterances containing words or expressions from a second language. Processing code-switched sentences is a significant challenge for multilingual intelligent systems. This study proposes a language modeling approach to the problem of codeswitching language processing, dividing the problem i...
متن کاملWord-level Language Identification in Bi-lingual Code-switched Texts
Code-switching is the practice of moving back and forth between two languages in spoken or written form of communication. In this paper, we address the problem of word-level language identification of code-switched sentences. Here, we primarily consider Hindi-English (Hinglish) code-switching, which is a popular phenomenon among urban Indian youth, though the approach is generic enough to be ex...
متن کامل